140

Applications in Natural Language Processing

FIGURE 5.14

Attention-head view for (a) full-precision BERT, (b) fully binarized BERT baseline, and

(c) BiBERT for same input. BiBERT with Bi-Attention shows similar behavior with the

full-precision model, while baseline suffers indistinguishable attention for information degra-

dation.

ited capabilities, the ideal binarized representation should preserve the given full-precision

counterparts as much as possible means the mutual information between binarized and

full-precision representations should be maximized. When the deterministic sign function

is applied to binarize BERT, the goal is equivalent to maximizing the information entropy

H(B) of binarized representation B [171], which is defined as

H(B) =



B

p(B) log p(B),

(5.26)

where B ∈{−1, 1} is the random variable sampled from B with probability mass function

p. Therefore, the information entropy of binarized representation should be maximized to

better preserve the full-precision counterparts and let the attention mechanism function

well.

As for the attention structure in full-precision BERT, the normalized attention weight

obtained by softmax is essential. But direct application of binarization function causes a

complete information loss to binarized attention weight. Specifically, since the softmax(A)

is regarded as following a probability distribution, the elements of Bs

A are all quantized to

1 (Fig. 5.14(b)) and the information entropy H(Bs

A) degenerates to 0. A common measure

to alleviate this information degradation is to shift the distribution of input tensors before

applying the sign function, which is formulated as

ˆBs

A = sign (softmax(A)τ) ,

(5.27)

where the shift parameter τ, also regarded as the threshold of binarization, is expected to

maximize the entropy of the binarized ˆBs

A and is fixed during the inference. Moreover, the

attention weight obtained by the sign function is binarized to {−1, 1}, while the original

attention weight has a normalized value range [0, 1]. The negative value of attention weight

in the binarized architecture is contrary to the intuition of the existing attention mechanism

and is also empirically proved to be harmful to the attention structure.

To mitigate the information degradation caused by binarization in the attention mech-

anism, the authors introduced an efficient Bi-Attention structure for fully binarized BERT,

which maximizes information entropy of binarized representations statistically and applies

bitwise operations for fast inference. In detail, they proposed to binarize the attention weight

into the Boolean value, while the design is driven by information entropy maximization. In

Bi-Attention, bool function is leveraged to binarize the attention score A, which is defined

as

bool(x) =



1,

if x0

0,

otherwise ,

(5.28)